Search CORE

The Gapped-Factor Tree

Author: Allali Julien
Peterlongo Pierre
Sagot Marie-France
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

International audienceWe present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration

CiteSeerX

HAL-Ecole des Ponts ParisTech

Hal-Diderot

HAL - UPEC / UPEM

Lire les lectures : analyse de données de séquençage

Author: Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 25/01/2016
Field of study

Tous les travaux présentés dans cette HDR concernent l’exploitation de données de séquençage haut débit en absence de génome de référence proche et de bonne qualité.Dans un premier chapitre, nous proposons de nouvelles approches pour extraire des variants biologiques d’intérêt de ces données de séquençage. Dans un second chapitre nous exposons des méthodes de comparaisons de jeux de données de séquençage. Enfin, dans un troisième chapitre, nous proposons une méthode préliminaire à de meilleurs « assemblages » de ces données de séquençage

Thèses en Ligne

arXiv.org e-Print Archive

Multiple Comparative Metagenomics using Multiset k-mer Counting

Author: Benoit Gaëtan
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Mariadassou Mahendra
Peterlongo Pierre
Schbath Sophie
Publication venue
Publication date: 28/04/2016
Field of study

Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

Directory of Open Access Journals

Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

Author: Chikhi Rayan
Limasset Antoine
Peterlongo Pierre
Rizk Guillaume
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHas

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

HAL Descartes

Hal-Diderot

Mapsembler, targeted assembly of larges genomes on a desktop computer

Author: Chikhi Rayan
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 16/03/2011
Field of study

Background: The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing sequences (whole-genome assemblers) are typically employed to process such data. However, one of the main drawback of these methods is the high memory requirement. Results: We present Mapsembler, an iterative targeted assembler which processes large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest in the reads and reconstructs their neighborhood, either as a plain sequence (consensus) or as a graph (full sequence structure). We introduce new algorithms to retrieve homologues of a sequence from reads and construct an extension graph. Conclusions: Mapsembler is the rst software that enables de novo discovery around a region of interest of gene homologues, SNPs, exon skipping as well as other structural events, directly from raw sequencing reads. Compared to traditional assembly software, memory requirement and execution time of Mapsembler are considerably lower, as data indexing is localized. Mapsembler can be used at http://mapsembler.genouest.or

BlastGraph: intensive approximate pattern matching in string graphs and de-Bruijn graphs

Author: Holley Guillaume
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 27/08/2012
Field of study

International audienceAbstract. Many de novo assembly tools have been created these last few years to assemble short reads generated by high throughput sequencing platforms. The core of almost all these assemblers is a string graph data structure that links reads together. This motivates our work: BlastGraph, a new algorithm performing intensive approximate string matching between a set of query sequences and a string graph. Our approach is similar to blast-like algorithms and additionally presents specificity due to the matching on the graph data structure. Our results show that BlastGraph performances permit its usage on large graphs in reasonable time. We propose a Cytoscape plug-in for visualizing results as well as a command line program. These programs are available at http://alcovna.genouest.org/blastree/

BGREAT: A De Bruijn graph read mapping tool

Author: Limasset Antoine
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 06/07/2015
Field of study

International audienceMapping reads on references is a central task in numerous genomic studies. Since references are mainly extracted from assembly graphs, it is of high interest to map efficiently on such structures. The problem of mapping sequences on a De Bruijn graph has been shown NP-complete[1] and no scalable generic tool exists yet. We motivate here the problem of mapping reads on a de Bruijn graph and we present a practical solution and its implementation called BGREAT. BGREAT handles real world instances of billions reads with moderate resources. Mapping on de Bruijn graph enable to keep whole genomic information and get rid off possible assembly mistakes. However the problem is theoretically hard to handle on real-world dataset. Using a set of heuristics, our proposed tool is able to map million read by CPU hours even on complex human genomes. BGREAT is available at github.com/Malfoy/BGREAT[1]Limasset, A., & Peterlongo, P. (2015). Read Mapping on de Bruijn graph. arXiv preprint arXiv:1505.04911. [2]Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biol 10.3 (2009): R25